Natural Language Processing - Mini-Challenge 2: Sentiment Analysis

For information about this mini-challenge, please view the document NPR-Mini-Challenge-2-Sentiment-Analysis.pdf in the docs directory.

1. Setup and Data Selection

Dataset Overview

We utilize the Stanford Sentiment Treebank (SST-2), a benchmark dataset for sentiment classification tasks. SST-2 consists of 11'855 single sentences extracted from movie reviews, parsed with the Stanford parser. These sentences are binary-labeled as either positive or negative, excluding neutral phrases. The dataset provides a comprehensive resource for evaluating sentiment classification models in natural language processing.

Practical Considerations

In real-world scenarios, it is common to have a small amount of labeled data and a large pool of unlabeled data. To emulate this condition while effectively leveraging weak labeling techniques, we adopt the following approach:

This setup ensures a clear separation of datasets for training and evaluation, allowing us to analyze the impact of weak labeling on model performance. By mimicking real-world constraints, this approach balances the need for practical experimentation with efficient use of available data.

Hierarchical Nested Splits

To analyze model performance variability and the effect of training data size, we create hierarchical nested splits from the sampled training data:

For better organization and reproducibility, the hierarchical splits are saved in a structured folder format:

Now that we have loaded our data we can start creating our baseline model.


2. Baseline Model

In this task, we utilize Fine-Tuning, where all layers of a pre-trained model (bert-base-uncased) are retrained to adapt to the sentiment classification task. This approach is suitable given the availability of hierarchical data splits ranging from 1% to 100% and the need to incorporate weak labels effectively. Fine-tuning enables us to fully leverage the model's pre-trained knowledge while tailoring it to our specific dataset.

Methodology

  1. Model:

    • We use BertForSequenceClassification for fine-tuning. The model's pre-trained transformer layers and the classification head are optimized for binary sentiment analysis.
  2. Data Splits:

    • Hierarchical nested splits ranging from 1% to 100% of the training data are created.

    • Multiple sets of these splits are generated to test the variability in model performance across randomized subsets.

  3. Training:

    • Tokenized input sentences are passed to the model.

    • Each split is trained individually to evaluate performance at different data sizes.

  4. Metrics:

    • Metrics such as accuracy, precision, recall and F1-score are recorded for each split.

    • ariability in performance across different split sets is analyzed.

  5. Comparison:

    • Performance is compared across data splits, with and without weak labels, to gain meaningful insights into the impact of labeled data size and weak labeling.

Observations from the Plot

The graph effectively demonstrates the importance of data quantity for model performance and highlights diminishing returns as the data volume approaches the total available dataset.

Hyperparameter Tuning

Observations


2. Text Embeddings

To represent texts numerically, we use sentence embeddings generated by the all-MiniLM-L6-v2 model from the Sentence-Transformers library. These embeddings capture the semantic meaning of entire sentences as high-dimensional vectors, enabling us to quantify similarities between texts. By focusing on the validation and test sets, we ensure an unbiased evaluation of how well embeddings generalize to unseen data.

Are the embeddings standardized?

If the mean is close to 0 and the standard deviation is close to 1, then the embeddings are standardized. Otherwise, they are not.

Based on these values, our embeddings are not standardized. We will keep this in mind when using the embeddings for downstream tasks.

Samples of most similar sentences

To assess embedding quality, we calculate cosine similarity, which measures the semantic closeness between vectors. For example, we use a reference sentence and calculate its similarity with other sentences in the dataset. The top-5 most similar sentences are retrieved, showcasing the model's ability to group semantically related content.

This section helps us understand how well the embeddings capture sentence meaning and semantic relationships.

In conclusion we can see that the most similar sentences to a reference sentence based on cosine similarity are indeed semantically related. This demonstrates the effectiveness of the sentence embeddings in capturing the underlying meaning of the text.

Furthest and Closest Sentences

Quantitative Exploration of Sentence Similarities

To understand the quality of embeddings generated by our model, we perform a quantitative analysis by exploring the distribution of cosine similarities and pointwise distances within the same class (for example "Label 1" to "Label 1" and "Label 0" to "Label 0"). This helps assess how well the embeddings represent semantic similarities within each class.

Analysis Steps

  1. Cosine Similarities:

    • Pairwise cosine similarities are computed for sentence embeddings within the same label group.

    • This quantifies the semantic similarity between embeddings, with values ranging from -1 (completely opposite) to 1 (identical).

  2. Pointwise Distances:

    • Calculated as (1 - cosine similarity), the pointwise distance provides a dissimilarity measure between embeddings.

    • Lower distances indicate closer embeddings, reflecting better semantic cohesion.

  3. Density Plot:

    • The KDE-plot visualizes the distribution of cosine similarities for each label, providing insights into class separability and cohesion.

Observations from the Plot

Key Insights

Visualize Embeddings

In this section we will examine the embeddings of the validation set using the dimensionality reduction techniques (t-SNE and UMAP) to visualize high-dimensional data in 2D space. These visualizations could help us understand the distribution of embeddings and identify clusters or patterns that may inform downstream tasks.

Observations from the t-SNE plots

The t-SNE embedding visualization demonstrates overlapping clusters for labels 0 and 1, indicating that the embedding space does not perfectly separate the classes.

The UMAP embedding visualization shows that the classes (Label 0 and Label 1) are largely overlapping in the reduced 2D space, similar to the t-SNE method.

In conclusion, these overlaps could be due to the complexity of the sentiment classification task, or more probable the limitations and simplicity of the embedding model. Further steps could include changing the embedding model to achieve better separability.

Visualize Embeddings with Different Embedding Model

Since the current embeddings do not perfectly separate the classes, we can try using a different embedding model. Based on the website sbert.com we will use the all-mpnet-base-v2 model to generate embeddings and visualize them using t-SNE and UMAP and see if the separability of classes improves.

We can see that the all-mpnet-base-v2 has a higher dimensionality than the previous model, which means that the embeddings will be more expressive and capture more information about the sentences. This could potentially lead to better separability of classes in the visualization.

The t-SNE visualization using the all-mpnet-base-v2 model shows better clustering compared to the previous model. We can clearly see some grouping patterns. However, there is still some overlap between the two classes, indicating challenges in complete separation of semantic representations for different labels.

The UMAP embedding visualization shows similar results to the t-SNE method.

In conclusion, the all-mpnet-base-v2 model provides better separability between classes compared to the previous model and we can consider using it for downstream tasks.


4. Weak Labeling

Weak labeling assigns pseudo-labels to unlabeled data by leveraging embedding-based similarities and predefined techniques. The goal is to create meaningful labels for unlabeled datasets, enabling their use in downstream tasks such as training or evaluation.

Techniques for Weak Labeling

  1. Majority Vote: Assigns the most frequent label among the top-k most similar labeled sentences (based on cosine similarity of embeddings).

  2. Weighted Voting: Weights each of the top-k neighbor labels by their similarity score. The label with the highest cumulative weighted score is chosen as the weak label. This approach typically provides a finer-grained measure of neighborhood agreement and often outperforms simple majority voting.

  3. Centroid-Based Labeling: Computes class centroids in the embedding space using the training set. Each unlabeled sample is assigned the label of the closest centroid. This provides a fast and scalable approach, though it may be less nuanced than neighbor-based methods.

Comparison and Technique Selection

We evaluate the three techniques on the validation set to determine the best performing method. After computing metrics such as accuracy, precision, recall, and F1-score, we select the approach that achieves the highest F1-score, as it balances precision and recall and is often most indicative of overall performance in classification tasks.

In the plot above we can see how each technique predicted the labels. Majority Vote and Weighted Vote are very similar, where Centroid-Based labelled more samples as Label 0.

Confidence-Based Selection for Data Augmentation

Once the best technique is identified (in this case, Weighted Voting), we apply it to the unlabeled pool. However, rather than adding all weakly labeled data—some of which might be noisy—we select only the top 5'000 samples with the highest confidence scores. For Weighted Voting, this confidence is the margin between the chosen label’s cumulative weight and that of the runner-up label. High-confidence samples are more likely to be correctly labeled and thus can enhance the quality of the augmented training data.

By incorporating these top 5'000 confident weak labels into our original 5'000-sample training set, we end up with a 10'000-sample training set that can potentially improve the robustness and generalizability of our final model.

Sample of Weakly Labeled Sentences

Here we will manually check a sample of weakly labeled sentences to assess the quality of the weak labeling technique. By examining the sentences and their assigned labels, we can maybe identify any potential issues or errors.

Positive sentiments (Label: 1) are associated with praise or admiration, while negative sentiments (Label: 0) align with criticism or dismissive tones.

The method appears to reasonably capture the polarity of sentiments based on contextual cues in the text, but is not perfect and may misclassify some sentences due to ambiguity or complexity.


5. Model training with additional weak labels

We will extend the baseline training code to incorporate both hard labels (from the labeled data) and weak labels. This will involve:

  1. Combining the hierarchically nested training splits with the weakly labeled data.
  2. Training and evaluating the model for each split size.
  3. Comparing the results to the baseline metrics.

When the training set is extremely small (1–5%), the model likely defaults to predicting a single class, causing all metrics to appear fixed and uninformative. As the training size grows (above ~7%), the model begins to learn meaningful patterns, leading to diverging and improving metrics. In other words, early stagnation of metrics indicates no real learning; once enough data is provided, the model’s performance metrics start to reflect true predictive capability. At around 30% of the training data, the model reaches good performance.


6. Model comparison

This section compares the baseline model performance with the model trained on weakly labeled data. We also evaluate the weak labels directly to determine if they are sufficient for sentiment classification or if training a classification model using these labels is more effective.

  1. Compare Baseline and Weak Label Models
  2. Evaluate Weak Labels Directly
  3. Decide on the Best Approach

Decide on the Best Approach

Observations:

Conclusion:

Training a classification model with weak labels does not seem worthwhile as it does not improve performance over the baseline model. While weak labels offer a cost-effective alternative to manual annotation and deliver okay results, they fail to boost the classification model's performance sufficiently. Therefore, no time savings factor was calculated, as the weak labeling approach did not achieve the expected improvements in model performance.

General Conclusion

This study highlights the challenges and potential of integrating weak labeling into a classification workflow. While the baseline model slightly outperformed the weak label model in most cases, the difference in performance was not substantial. This suggests that, under different circumstances (such as improved weak label generation strategies or larger datasets) the use of weak labels might provide significant advantages. Weak labeling remains a promising approach for reducing manual annotation efforts and could be particularly beneficial when dealing with limited hard labels or resource constraints. Further exploration of more robust weak labeling techniques and their application to diverse datasets could unlock their full potential in enhancing classification performance.


Use of AI

In this project, ChatGPT was used extensively to support the implementation and documentation of the sentiment analysis pipeline. The following tasks were accomplished using the tool:

  1. Code Implementation:

    • Developing weak labeling strategies (majority vote, weighted vote, centroid-based).
    • Calculating cosine similarities between embeddings and selecting samples with high similarity scores.
    • Designing a training loop and integrating weak labels into the model pipeline.
  2. Optimization and Debugging:

    • Refining code to ensure dataset consistency and compatibility with project requirements.
    • Streamlining processes to maintain a balanced training set size of 10,000 samples.
  3. Documentation:

    • Adding detailed comments to complex code sections to enhance readability and reproducibility.

Assessment of Prompting Strategies

Prompting Strategies Used

  1. Task-Specific Prompts:

    • Example: "Generate code to calculate cosine similarity and select the top-k most similar samples."
    • Effectiveness: Delivered targeted, actionable code snippets with minimal modifications.
  2. Iterative Refinement:

    • Example: Initial prompts provided a general framework, followed by refinements to address specific requirements or edge cases.
    • Effectiveness: Allowed for precise tailoring of solutions to meet project goals.

Most Successful Strategy

Reflection on AI Tool Usage

The use of ChatGPT significantly accelerated the development process, especially for repetitive or complex tasks. By leveraging AI, we could focus more on strategic decisions and evaluation, while routine coding tasks were handled efficiently. Moreover, the iterative refinement process fostered a deeper understanding of key concepts in weak labeling and sentiment analysis.

The dual role of ChatGPT as both a problem-solving assistant and a learning tool proved invaluable, contributing to the success of this project and the acquisition of advanced skills in natural language processing.